Search CORE

272 research outputs found

Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view

Author: Scherrer Bruno
Publication venue
Publication date: 01/01/2010
Field of study

We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the one-step Temporal Difference fix-point computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the objective function they minimize, and show that while BR enjoys a performance guarantee, TD(0) does not in general. We then propose a unified view in terms of oblique projections of the Bellman equation, which substantially simplifies and extends the characterization of (schoknecht,2002) and the recent analysis of (Yu & Bertsekas, 2008). Eventually, we describe some simulations that suggest that if the TD(0) solution is usually slightly better than the BR solution, its inherent numerical instability makes it very bad in some cases, and thus worse on average

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Approximate Policy Iteration Schemes: A Comparison

Author: Scherrer Bruno
Publication venue
Publication date: 12/05/2014
Field of study

We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP

_\infty

), and the recently proposed Non-Stationary Policy iteration (NSPI(m)). For all algorithms, we describe performance bounds, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API/API(

\alpha

), but this comes at the cost of a relative---exponential in

\frac{1}{\epsilon}

---increase of the number of iterations. 2) PSDP

_\infty

enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP

_\infty

is proportional to their number of iterations, which may be problematic when the discount factor

\gamma

is close to 1 or the approximation error

\epsilon

is close to

0

; we show that the NSPI(m) algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes confirm our analysis.Comment: ICML (2014

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

On the Performance Bounds of some Policy Search Dynamic Programming Algorithms

Author: Scherrer Bruno
Publication venue
Publication date: 03/06/2013
Field of study

We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on Policy Search algorithms, that compute an approximately optimal policy by following the standard Policy Iteration (PI) scheme via an -approximate greedy operator (Kakade and Langford, 2002; Lazaric et al., 2010). We describe existing and a few new performance bounds for Direct Policy Iteration (DPI) (Lagoudakis and Parr, 2003; Fern et al., 2006; Lazaric et al., 2010) and Conservative Policy Iteration (CPI) (Kakade and Langford, 2002). By paying a particular attention to the concentrability constants involved in such guarantees, we notably argue that the guarantee of CPI is much better than that of DPI, but this comes at the cost of a relative--exponential in

\frac{1}{\epsilon}

-- increase of time complexity. We then describe an algorithm, Non-Stationary Direct Policy Iteration (NSDPI), that can either be seen as 1) a variation of Policy Search by Dynamic Programming by Bagnell et al. (2003) to the infinite horizon situation or 2) a simplified version of the Non-Stationary PI with growing period of Scherrer and Lesner (2012). We provide an analysis of this algorithm, that shows in particular that it enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a time complexity similar to that of DPI

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Rate of Convergence and Error Bounds for LSTD( $\lambda$ )

Author: Scherrer Bruno
Tagorti Manel
Publication venue
Publication date: 01/05/2014
Field of study

We consider LSTD(

\lambda

), the least-squares temporal-difference algorithm with eligibility traces algorithm proposed by Boyan (2002). It computes a linear approximation of the value function of a fixed policy in a large Markov Decision Process. Under a

\beta

-mixing assumption, we derive, for any value of

\lambda \in (0,1)

, a high-probability estimate of the rate of convergence of this algorithm to its limit. We deduce a high-probability bound on the error of this algorithm, that extends (and slightly improves) that derived by Lazaric et al. (2012) in the specific case where

\lambda=0

. In particular, our analysis sheds some light on the choice of

\lambda

with respect to the quality of the chosen linear space and the number of samples, that complies with simulations.Comment: (2014

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee

Author: Geist Matthieu
Scherrer Bruno
Publication venue
Publication date: 06/06/2013
Field of study

Local Policy Search is a popular reinforcement learning approach for handling large state spaces. Formally, it searches locally in a paramet erized policy space in order to maximize the associated value function averaged over some predefined distribution. It is probably commonly b elieved that the best one can hope in general from such an approach is to get a local optimum of this criterion. In this article, we show th e following surprising result: \emph{any} (approximate) \emph{local optimum} enjoys a \emph{global performance guarantee}. We compare this g uarantee with the one that is satisfied by Direct Policy Iteration, an approximate dynamic programming algorithm that does some form of Poli cy Search: if the approximation error of Local Policy Search may generally be bigger (because local search requires to consider a space of s tochastic policies), we argue that the concentrability coefficient that appears in the performance bound is much nicer. Finally, we discuss several practical and theoretical consequences of our analysis

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Lesner Boris
Scherrer Bruno
Publication venue
Publication date: 29/11/2012
Field of study

We consider infinite-horizon stationary

\gamma

-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error

\epsilon

at each iteration, it is well-known that one can compute stationary policies that are

\frac{2\gamma}{(1-\gamma)^2}\epsilon

-optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for computing non-stationary policies that can be up to

\frac{2\gamma}{1-\gamma}\epsilon

-optimal, which constitutes a significant improvement in the usual situation when

\gamma

is close to 1. Surprisingly, this shows that the problem of "computing near-optimal non-stationary policies" is much simpler than that of "computing near-optimal stationary policies"

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Tight Performance Bounds for Approximate Modified Policy Iteration with Non-Stationary Policies

Author: Lesner Boris
Scherrer Bruno
Publication venue
Publication date: 19/04/2013
Field of study

We consider approximate dynamic programming for the infinite-horizon stationary

\gamma

-discounted optimal control problem formalized by Markov Decision Processes. While in the exact case it is known that there always exists an optimal policy that is stationary, we show that when using value function approximation, looking for a non-stationary policy may lead to a better performance guarantee. We define a non-stationary variant of MPI that unifies a broad family of approximate DP algorithms of the literature. For this algorithm we provide an error propagation analysis in the form of a performance bound of the resulting policies that can improve the usual performance bound by a factor

O(1-\gamma)

, which is significant when the discount factor

\gamma

is close to 1. Doing so, our approach unifies recent results for Value and Policy Iteration. Furthermore, we show, by constructing a specific deterministic MDP, that our performance guarantee is tight

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

A Theory of Regularized Markov Decision Processes

Author: Geist Matthieu
Pietquin Olivier
Scherrer Bruno
Publication venue
Publication date: 04/06/2019
Field of study

Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. This also draws connections to proximal convex optimization, especially to Mirror Descent.Comment: ICML 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server